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Abstract 


We  present  an  exhaustive  analysis  of  the  problem  of  computing  the  rel¬ 
ative  entropy  of  two  probabilistic  automata.  We  show  that  the  problem 
of  computing  the  relative  entropy  of  unambiguous  probabilistic  automata 
can  be  formulated  as  a  shortest-distance  problem  over  an  appropriate 
semiring,  give  efficient  exact  and  approximate  algorithms  for  its  compu¬ 
tation  in  that  case,  and  report  the  results  of  experiments  demonstrating 
the  practicality  of  our  algorithms  for  very  large  weighted  automata.  We 
also  prove  that  the  computation  of  the  relative  entropy  of  arbitrary  prob¬ 
abilistic  automata  is  PSPACE-complete. 

The  relative  entropy  is  used  in  a  variety  of  machine  learning  algorithms 
and  applications  to  measure  the  discrepancy  of  two  distributions.  We  ex¬ 
amine  the  use  of  the  symmetrized  relative  entropy  in  machine  learning 
algorithms  and  show  that,  contrarily  to  what  is  suggested  by  a  number 
of  publications,  the  symmetrized  relative  entropy  is  neither  positive  defi¬ 
nite  symmetric  nor  negative  definite  symmetric,  which  limits  its  use  and 
application  in  kernel  methods.  In  particular,  the  convergence  of  training 
for  learning  algorithms  is  not  guaranteed  when  the  symmetrized  relative 
entropy  is  used  directly  as  a  kernel,  or  as  the  operand  of  an  exponential 
as  in  the  case  of  Gaussian  Kernels. 

Finally,  we  show  that  our  algorithm  for  the  computation  of  the  en¬ 
tropy  of  an  unambiguous  probabilistic  automaton  can  be  generalized  to 
the  computation  of  the  norm  of  an  unambiguous  probabilistic  automaton 
by  using  a  monoid  morphism.  In  particular,  this  yields  efficient  algorithms 
for  the  computation  of  the  Lp-norm  of  a  probabilistic  automaton. 


1  Introduction 

The  problem  of  comparing  two  distributions  arises  in  a  variety  of  applications. 
A  specific  instance  of  that  problem  is  that  of  comparing  distributions  given  by 
probabilistic  automata.  Probabilistic  automata  are  used  extensively  in  text  and 
speech  processing  to  model  different  aspects  of  language  such  as  morphology, 
phonology,  or  syntax  [Mohri,  1997]  or  in  other  applications  such  as  computa¬ 
tional  biology  [Durbin  et  ah,  1998]  and  image  processing  [Culik  II  and  Kari, 
1997]. 

The  output  of  a  large-vocabulary  speech  recognition  system  or  that  of  a 
complex  information  extraction  system  is  often  represented  as  a  probabilistic 
automaton  compactly  representing  a  large  set  of  alternative  sequences  [Mohri 
et  ah,  2002],  Natural  language  sequences  such  as  documents  or  biological  se¬ 
quences  can  also  be  modeled  by  probabilistic  automata  [Krogh  et  ah,  1994].  The 
computation  of  the  distance  or  discrepancy  between  probabilistic  automata  can 
thus  be  used  to  cluster  the  outputs  of  speech  recognition  or  information  extrac¬ 
tion  systems,  documents,  biological  sequences,  or  other  objects  modeled  in  a 
similar  way. 

The  problem  of  efficiently  computing  the  distance  between  two  distributions 
represented  by  weighted  automata  arises  in  many  other  machine  learning  prob¬ 
lems.  When  a  weighted  automaton  is  obtained  as  a  result  of  training  on  a  large 
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data  set,  the  quality  of  the  learning  algorithm  can  be  measured  by  computing 
the  distance  between  the  automaton  inferred  and  that  of  the  target  automaton. 
Similarly,  in  many  on-line  learning  algorithms  and  grammar  inference  applica¬ 
tions,  the  convergence  of  an  iterative  algorithm  relies  on  the  magnitude  of  the 
distance  between  two  consecutive  weighted  automata. 

This  motivates  the  design  of  efficient  algorithms  for  the  computation  of  the 
distance  or  discrepancy  between  probabilistic  automata.1  There  are  many  stan¬ 
dard  distances  or  divergences  commonly  used  to  compare  distributions.  The 
relative  entropy  or  Kullback-Leibler  divergence,  the  Lp  distance,  the  Hellinger 
distance,  the  Jensen-Shannon  distance,  the  x~distance,  and  the  Triangle  dis¬ 
tance  between  two  distributions  q\  and  <72  defined  over  a  discrete  set  X  [Topspe, 
2000,  Csiszar  and  Korner,  1997]. 

In  a  companion  paper,  we  give  an  exhaustive  study  of  the  problem  of  com¬ 
puting  the  Lp  distance  of  two  probabilistic  automata  and  other  similar  distances 
such  as  the  Hellinger  distance  [Cortes  et  al.,  2006a].  In  particular,  we  give  ef¬ 
ficient  exact  and  approximate  algorithms  for  computing  these  distances  for  p 
even  and  prove  the  problem  to  be  NP-hard  for  all  odd  values  of  p ,  thereby 
completing  previously  known  hardness  results.  We  also  show  the  hardness  of 
approximating  the  Lp  distance  of  two  probabilistic  automata  for  odd  values  of 
P- 

This  paper  deals  with  the  problem  of  computing  the  relative  entropy  of  two 
probabilistic  automata.  The  relative  entropy,  or  Kullback-Leibler  divergence, 
is  one  of  the  most  commonly  measure  of  the  discrepancy  of  two  distributions 
p  and  q  [Cover  and  Thomas,  1991].  It  is  an  asymmetric  difference  that  admits 
the  following  information-theoretical  interpretation:  it  measures  the  number  of 
additional  bits  needed  to  encode  distribution  p  when  using  an  optimal  code  for 
q  in  place  of  an  optimal  code  for  p. 

One  approximate  solution  for  the  computation  of  the  relative  entropy  would 
consist  of  sampling  sequences  from  the  distributions  represented  by  each  of  the 
automata  and  of  using  those  to  compute  the  KL-divergence  by  simply  summing 
their  contributions.  But,  sample  sizes  guaranteeing  a  small  approximation  error 
could  be  very  large,  which  would  significantly  increase  the  computation,  while 
still  providing  only  an  approximate  solution. 

We  present  an  exhaustive  analysis  of  the  problem  of  computing  the  relative 
entropy  of  two  probabilistic  automata.  We  show  that  the  problem  of  computing 
the  relative  entropy  of  unambiguous  probabilistic  automata  can  be  formulated 
as  a  shortest-distance  problem  over  an  appropriate  semiring,  give  efficient  exact 
and  approximate  algorithms  for  its  computation  in  that  case,  and  report  the 
results  of  experiments  demonstrating  the  practicality  of  our  algorithms  for  very 
large  weighted  automata.  We  also  prove  that  the  computation  of  the  relative 
entropy  of  arbitrary  probabilistic  automata  is  PSPACE-complete. 

A  procedure  for  the  approximate  computation  of  the  relative  entropy  was 
given  by  Carrasco  [1997].  The  procedure  applies  to  deterministic  weighted  au- 

1 A  related  problem  is  that  of  testing  the  equivalence  of  two  arbitrary  probabilistic  automata 
Ai  and  A2.  In  [Cortes  et  al.,  2006b, a],  we  give  an  efficient  algorithm  for  this  problem  whose 
time  complexity  is  0(|E|  (|Ai|  +  IA2I)3). 
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tomata  and  cannot  be  generalized  to  the  case  of  unambiguous  weighted  au¬ 
tomata  because  of  the  specific  sum  decomposition  it  is  based  on  (the  partition¬ 
ing  assumed  in  [Carrasco,  1997]  [eq.  15,  page  6]  does  not  hold  for  unambiguous 
automata).  Our  algorithms  apply  to  the  larger  class  of  unambiguous  weighted 
automata.  For  some  unambiguous  weighted  automata,  the  size  of  any  equiva¬ 
lent  deterministic  weighted  automaton  is  exponentially  larger.  Since  the  size  of 
the  machine  directly  affects  the  complexity  of  the  computation,  it  is  important 
to  be  able  to  compute  the  entropy  directly  from  the  unambiguous  automaton. 
We  give  the  first  exact  algorithms  for  the  computation  of  the  relative  entropy. 
We  also  describe  approximate  algorithms  that  are  conceptually  simpler  than  the 
procedure  of  Carrasco  [1997]  and  have  a  better  time  and  space  complexity. 

The  relative  entropy  is  used  in  a  variety  of  machine  learning  algorithms 
and  applications  to  measure  the  discrepancy  of  two  distributions.  We  examine 
the  use  of  the  symmetrized  relative  entropy  in  machine  learning  algorithms 
and  show  that,  contrarily  to  what  is  suggested  by  a  number  of  publications 
(e.g.,  [Mandel  et  al.,  2006]),  the  symmetrized  relative  entropy  is  neither  positive 
definite  symmetric  nor  negative  definite  symmetric,  which  limits  its  use  and 
application  in  kernel  methods.  In  particular,  the  convergence  of  training  for 
learning  algorithms  is  not  guaranteed  when  the  symmetrized  relative  entropy  is 
used  directly  as  a  kernel,  or  as  the  operand  of  an  exponential  as  in  the  case  of 
Gaussian  Kernels  [Scholkopf  and  Smola,  2002], 

Finally,  we  show  that  our  algorithm  for  the  computation  of  the  entropy  of 
an  unambiguous  probabilistic  automaton  can  be  generalized  to  the  computation 
of  the  norm  of  an  unambiguous  probabilistic  automaton  by  using  a  monoid 
morphism  [Cortes  et  al.,  2006b].  In  particular,  this  yields  efficient  algorithms 
for  the  computation  of  the  Lp-norm  of  a  probabilistic  automaton. 

The  paper  is  organized  as  follows.  Section  2  introduces  the  preliminary 
semiring  and  automata  definitions  used  in  the  remaining  of  the  paper.  Sec¬ 
tion  3  recalls  the  definition  of  the  relative  entropy  of  two  probabilistic  automata 
and  introduces  a  semiring,  the  entropy  semiring ,  which  helps  formulate  the 
computation  of  the  relative  entropy  of  unambiguous  probabilistic  automata  as 
a  shortest-distance  problem.  Section  4  describes  both  an  exact  and  a  fast  ap¬ 
proximate  algorithm  for  the  computation  of  the  relative  entropy  of  unambiguous 
probabilistic  automata.  It  also  provides  a  detailed  analysis  of  these  algorithms 
and  reports  the  results  of  experiments  with  large  weighted  automata.  The  case 
of  arbitrary  probabilistic  automata  is  treated  in  Section  5  where  the  problem 
is  proven  to  be  PSPACE-complete.  Section  6  proves  several  negative  results 
for  the  use  of  the  symmetrized  relative  entropy  in  kernel  methods.  It  proves 
that  the  symmetrized  relative  entropy  is  neither  positive  definite  nor  negative 
definite.  Finally,  Section  7  extends  our  algorithm  for  the  computation  of  the 
entropy  of  a  probabilistic  automaton  to  the  computation  of  other  norms  defined 
via  a  monoid  morphism. 
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2 


Preliminaries 


2.1  Semirings  and  Weighted  Automata 

Weighted  automata  are  automata  in  which  each  transition  carries  some  weight 
in  addition  to  the  usual  alphabet  symbol  [Eilenberg,  1974-1976,  Salomaa  and 
Soittola,  1978,  Berstel  and  Reutenauer,  1988].  For  various  operations  to  be  well- 
defined,  the  weight  set  must  have  the  algebraic  structure  of  a  semiring  [Kuich 
and  Salomaa,  1986].  A  semiring  is  a  ring  that  may  lack  negation. 

Definition  1.  A  semiring  is  a  system  (K,®,®,0, 1)  such  that: 

•  (K,  ®,  0)  is  a  commutative  monoid  with  0  as  the  identity  element  for  ®, 

•  (K,  ®,  1)  is  a  monoid  with  1  as  the  identity  element  for 

•  ®  distributes  over  ®:  for  all  a,  b ,  c  in  K, 

(a  ®  b)  <g>  c  =  (a  <8  c)  ®  (6  ®  c)  and  c  ®  (a  ®  b)  =  (c  0  a)  ®  (c  ®  b) . 

•  0  is  an  annihilator  for  ®:  Va£l,a0O  =  O®a  =  O. 

Some  familiar  semirings  are  the  Boolean  semiring  ({0, 1},  V,  A,  0, 1)  or  the  trop¬ 
ical  semiring  (R+  U  {oo},  min,  +,  oo,  0)  related  to  classical  shortest-paths  prob¬ 
lems  and  algorithms.  A  semiring  is  idempotent  if  for  all  a  £  K,  a  ®  a  =  a.  It  is 
commutative  when  ®  is  commutative. 

Definition  2.  A  weighted  automaton  A  =  (E,  Q,  I,  F,  E,  A,  p)  over  a  semiring 
(K,  ®,®,0, 1)  is  a  7-tuple  where: 

•  E  is  the  finite  alphabet  of  the  automaton, 

•  Q  is  a  finite  set  of  states, 

•  ICQ  the  set  of  initial  states, 

•  F  C  Q  the  set  of  final  states, 

•  ECQxEU{e]xKxQa  finite  set  of  transitions, 

•  A  :  I  — >  K  the  initial  weight  function  mapping  I  to  K,  and 

•  p  :  F  — >  K  the  final  weight  function  mapping  F  to  K. 

The  weighted  automata  considered  in  this  paper  are  assumed  not  to  contain 
e-transitions.  A  pre-processing  e-removal  algorithm  can  be  used  to  remove  such 
transitions  for  the  automata  considered  here  [Mohri,  2002a].  Furthermore,  it  is 
assumed  that  the  automata  are  trim ,  i.e.  all  states  in  the  automata  are  both 
accessible  and  co-accessible. 

We  denote  by  |A|  =  |2£|  +  |<3|  the  size  of  an  automaton  A  =  (E,  Q ,  /,  F,  E,  A,  p), 
that  is  the  sum  of  the  number  of  states  and  transitions  of  A.  Given  a  transition 
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Figure  1:  An  unambiguous  weighted  finite  automaton  that  cannot  be  deter- 
minized.  0  is  the  initial  state  and  1  the  final  state.  The  automaton  accepts  the 
set  of  strings  (a*b*)*ab*. 


e  £  E,  we  denote  by  i\e]  its  input  label,  p[e]  its  origin  or  previous  state  and  n[e\ 
its  destination  state  or  next  state,  w[e ]  its  weight  (weighted  automata  case). 
Given  a  state  q  £  Q,  we  denote  by  E[q]  the  set  of  transitions  leaving  q. 

A  path  7r  =  ei  •  •  ■  e*;  in  A  is  an  element  of  E*  with  consecutive  transitions: 

n[ej_i]  =  p[ei],  i  =  2, . . . ,  k.  We  extend  n  andp  to  paths  by  setting:  n[7r]  =  n[ek ] 
and  p[7r]  =  p[e i].  We  denote  by  P^q,^)  the  set  of  paths  from  q  to  q'  and 

by  P(q,x,q')  the  set  of  paths  from  q  to  q '  with  input  label  x  £  E*.  The 

labeling  functions  i  and  the  weight  function  w  can  also  be  extended  to  paths  by 
defining  the  label  of  a  path  as  the  concatenation  of  the  labels  of  its  constituent 
transitions,  and  the  weight  of  a  path  as  the  ©-product  of  the  weights  of  its 
constituent  transitions:  *[7r]  =  i[e i]  •  •  •  * [e*,] ,  wj[7t]  =  w\e i]  ©  •  •  •  ©  w\ek\- 

The  output  weight  associated  by  an  automaton  A  to  an  input  string  x  £  E* 
is  defined  by: 

M(z)=  0  A[p[?r]]  ©w[tt]  ®  p[n[ir]].  (1) 

n  £P(I,x,F) 


2.2  Deterministic  and  Unambiguous  Weighted  automata 

A  weighted  automaton  A  is  said  to  be  deterministic  or  subsequential  if  it  has  a 
deterministic  input,  that  is  if  it  has  a  unique  initial  state  and  if  no  two  transitions 
leaving  the  same  state  share  the  same  input  label.  A  weighted  automaton  is 
said  to  be  unambiguous  if  for  any  x  £  E*  it  admits  at  most  one  accepting  path 
labeled  with  x.  Thus,  the  class  of  unambiguous  weighted  automata  includes 
deterministic  weighted  automata. 

Fig.  1  shows  an  unambiguous  weighted  automaton  that  does  not  admit  an 
equivalent  deterministic  weighted  automaton.  Previous  work  on  the  computa¬ 
tion  of  the  relative  entropy  [Carrasco,  1997]  was  limited  to  deterministic  finite 
automata.  We  present  the  first  algorithms  for  the  computation  of  the  relative 
entropy  of  unambiguous  weighted  automata. 

2.3  Shortest-Distances 

Let  s[A]  denote  the  ©-sum  of  the  weights  of  all  successful  paths  of  A  when  it  is 
defined  and  in  K.  s[A]  can  be  viewed  as  the  shortest- distance  from  the  initial 
states  to  the  final  states.  When  the  sum  of  the  weights  of  all  paths  from  any 


6 


state  p  to  any  state  q  is  well-defined  and  in  K,  we  can  define  the  shortest  distance 
from  p  £  Q  to  q  £  Q  as: 

d[p,q}=  0  w[ir\,  (2) 

n£P(p,q) 

where  the  summation  is  defined  to  be  0  when  P(p,q)  =  0. 

2.4  Probabilistic  Automata 

Definition  3.  A  weighted  automaton  A  defined  over  the  probability  semiring 
(R+,  +,  x,  0, 1)  is  said  to  be  probabilistic  if  for  any  state  q  £  Q,  0jr£P(g  q ^  w[i r], 
the  sum  of  the  weights  of  all  cycles  at  q,  is  well-defined  and  in  and 

Emw  =  1'  (3) 

A  probabilistic  automaton  A  is  said  to  be  stochastic  if  at  each  state  the  weights 
of  the  outgoing  transitions  and  the  final  weight  sum  to  one. 

Note  that  our  definition  of  probabilistic  automata  differs  from  that  of  Rabin 
[1963]  and  Paz  [1971].  Probabilistic  automata  as  defined  by  these  authors  are 
weighted  automata  over  (R+,+,  x,0, 1)  such  that  at  any  state  q  and  for  any 
label  a  £  E,  the  weights  of  the  outgoing  transitions  of  q  labeled  with  a  sum  to 
one.  More  generally,  with  that  definition,  the  weights  of  the  paths  leaving  state 
q  and  labeled  with  x  £  E*  sums  to  one.  Such  automata  define  a  conditional 
probability  distribution  Pr[g'  |  q ,  x]  over  all  states  q'  that  can  be  reached  from 
q  by  reading  x. 

Instead,  with  our  definition,  probabilistic  automata  represent  distributions 
over  E*,  Pr[x],a;  £  E*.  These  are  the  natural  distributions  that  arise  in  many 
applications.  They  are  inferred  from  large  data  sets  using  statistical  learning 
techniques.  We  are  interested  in  computing  the  relative  entropy  of  two  such 
distributions  over  strings. 

2.5  Intersection  of  Weighted  Automata 

Let  Ai  and  A2  be  two  weighted  automata  with  A  =  (E,  Qi,  fi,  Fi,  Ei,  A^,  pi) 
for  i  =  1,2.  The  intersection  A  of  Ai  and  A2  is  denoted  by  A  =  Ai  fl  A2. 
It  is  a  weighted  automaton  accepting  the  language  L{Afi)  fl  L(A2)  and  defined 
by  the  tuple  A  =  (E,Qi  x  Q2lh  x  J2,i7’i  x  F2,  E,  (Ai,  A2),  (pi,  p2)),  where  the 
transitions  E  are  defined  according  to  the  following  rule: 

(q!,a,wi,q2)  £  E1  and  (q[,  a,  w[,  q'2)  £  E2  =>  ((<?i,  q[),  a,  (q2,  q’2))  £  E. 

There  exists  a  general  algorithm  for  the  computation  of  the  intersection  over 
an  arbitrary  semiring,  even  in  the  presence  of  e-transitions  [Mohri  et  ah,  1996]. 
The  time  complexity  of  the  algorithm  is  quadratic  0(|A  |  \A2 1)  since  in  the  worst 
case  the  outgoing  transitions  of  each  state  of  A\  match  all  those  of  each  state 
of  A2. 
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3  Relative  Entropy 

The  problem  that  we  are  interested  in  is  that  of  computing  D(A\\B),  the  relative 
entropy  of  two  unambiguous  probabilistic  automata  A  and  B. 

3.1  Definition 

The  entropy  H (p)  of  a  probability  distribution  p  defined  over  a  discrete  set  X 
is  defined  as  [Cover  and  Thomas,  1991]: 

H{p)  =  -  P(x)l°SP(x)’  (4) 

xex 

where  by  convention  OlogO  =  0.  The  relative  entropy ,  or  Kullback-Leibler  di¬ 
vergence  of  two  probability  distributions  defined  over  a  discrete  set  X  is  defined 
as: 

D(p\\q)  =  X^(a:)log^y  =  Ep[iog  (5) 

where  we  use  the  standard  conventions:  Olog^  =  0  and  plog^  =  oo.  It  is 
straightforward  to  show,  using  Jensen’s  inequality,  that  the  relative  entropy  is 
non-negative  and  that  D(p\\q)  =  0  if  and  only  if  p  =  q.  Note  that  the  relative 
entropy  does  not  define  a  metric  since  it  is  not  symmetric  and  does  not  satisfy 
the  triangle  inequality. 

These  definitions  naturally  apply  to  probabilistic  automata  since  they  define 
distributions  over  strings.  The  relative  entropy  of  A  and  B  can  be  written  as 
the  sum  of  two  terms:2 

D{A\\B)  =  M (x)  log [A] (x)  -  ^  [A]  (a;)  log  [5]  (a).  (6) 

X  X 

3.2  Entropy  Semiring 

This  section  introduces  a  semiring  that  will  be  later  used  to  formulate  the  prob¬ 
lem  of  computing  the  relative  entropy  of  two  unambiguous  automata  as  a  single¬ 
source  shortest-distance  problem. 

Let  IK  denote  (R  U  {+oo,  —  oo})  x  (R  U  {+oo,  —  oo}).  For  pairs  {x\,y\)  and 
(£2,2/2)  hr  K,  define  the  following  : 

(£1, 2/i)  ©  (x2, 2/2)  =  (xi+x2,y1+y2)  (7) 

(£1,2/1)  ®  (x2, 2/2)  =  (xix2,xiy2  +  x2yi)  (8) 

Lemma  1.  The  system  (K,  ®,  ®,  (0,  0),  (1,  0))  defines  a  commutative  semiring. 

Proof.  It  is  known  that  (K,  ®,  (0,  0))  is  a  commutative  monoid  with  (0,0)  as 
the  identity  element  for  ®.  Furthermore,  it  is  clear  that  (K,  <8»,  (1,0))  is  a  com¬ 
mutative  monoid  with  (1,0)  as  the  identity  element  for  <g).  Also,  (0,0)  is  an 

2The  first  term  is  simply  —H(A),  where  H(A )  is  the  entropy  of  A. 
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Figure  2:  Illustration  of  the  completion  operation. 


annihilator  for  0.  Thus,  all  that  remains  to  be  shown  is  that  <g>  distributes 
over  ®.  Since  both  operations  are  commutative,  we  need  to  verify  that  for  all 
Zl,Z2,Z3  G  K, 

(zi  ®  z2)  0  z3  =  (z i  ®  z3)  ®  (z2  0  z3)  (9) 

Let  Zi  =  ( Xi,yi )  for  i  =  1,2,3.  We  verify  each  of  these  properties  one-by-one. 
First  consider  {z\  ®  z2)  0  z3.  We  have 


(zi  ®  z2)  0  z3 


((xi,yi)  ®  (x2,y2))  0  {x3,y3) 

(xi  +x2,yi  +  y2)  0  (x3,  y3) 

((*i  +  x2)x3,  (*i  +  x2)y3  +  x3(yi  +  y2)) 
{xix3,  Xxy3  +  x3yi)  ®  ( x2x3 ,  x2y3  +  x3y2) 
({xi,yi)  0  (x3,y3))  ®  (( X2,y2 )  ®  ( x3,y3 )) 
(zi  0  z3)  ®  (z2  0  z3), 


which  ends  the  proof  of  the  lemma. 


□ 


We  call  the  semiring  just  defined  the  entropy  semiring  due  to  its  relevance 
in  the  computation  of  the  entropy  and  the  relative  entropy.  This  semiring  arises 
in  other  contexts  and  can  be  defined  in  terms  of  an  S'-module  [Bloom  and  Esik, 
1991,  Eisner,  2001]. 


4  Relative  Entropy  of  Unambiguous  Probabilis¬ 
tic  Automata 

This  section  describes  two  algorithms  for  computing  the  relative  entropy  of 
two  unambiguous  probabilistic  automata  using  a  single-source  shortest  distance 
over  the  entropy  semiring:  an  exact  algorithm,  and  a  more  efficient  and  practical 
approximate  algorithm.  Clearly,  these  algorithms  can  also  be  used  to  compute 
the  entropy  of  a  single  unambiguous  probabilistic  automaton. 

4.1  Semiring  Formulation 

The  unambiguous  weighted  automata  A  and  B  are  not  necessarily  complete:  at 
some  states,  there  may  be  no  outgoing  transition  labeled  with  a  given  element 
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of  the  alphabet  a  £  £.  We  can  however  make  them  complete  in  a  way  similar 
to  the  standard  construction  in  the  unweighted  case.  We  introduce  a  new  state 
<7o  with  final  weight  0,  add  self- loops  with  weight  0  at  that  state  labeled  with  all 
elements  of  the  alphabet,  and  for  any  a  £  £  and  q  £  Q,  add  a  transition  from 
state  q  to  qo  labeled  with  a  with  weight  0  when  q  does  not  have  an  outgoing 
transition  labeled  with  a  (see  Figure  2).  This  construction  leads  to  a  complete 
and  unambiguous  weighted  automaton  equivalent  to  the  original  one  since  the 
transitions  added  have  all  weight  0.  The  completion  operation  is  only  applied  to 
handle  the  boundary  case  when  there  exists  a  string  x  £  £*  such  that  [U]  (x)  =  0 
and  |A](x)  ^  0.  In  this  case,  the  completion  operation  ensures  that  the  future 
computation  of  the  relative  entropy  would  correctly  lead  to  oo.  Note  that  the 
completion  operation  can  be  done  on-demand.  States  and  transitions  can  be 
created  only  when  necessary  for  the  application  of  other  operations.  We  can 
thus  assume  that  A  and  B  are  unambiguous  and  complete.  At  the  cost  of 
introducing  a  super-initial  and  a  super-final  state,  we  can  also  assume  in  the 
following,  without  loss  of  generality,  that  the  initial  weight  A  and  the  final 
weights  p(q )  are  all  equal  to  1  in  A  and  B. 

Let  log  A  denote  the  weighted  automaton  derived  from  A  by  replacing  each 
weight  w  £  R+  by  logw  and  let  $i(A)  (<b2(A))  denote  the  weighted  automaton 
over  the  entropy  semiring  derived  from  A  by  replacing  each  weight  w  by  the 
pair  (w,0)  (resp.  (l,w)).  The  construction  of  log  A,  $i(A),  or  <h2(A)  from  A  is 
straightforward  and  can  be  done  in  linear  time. 

Proposition  2.  The  relative  entropy  of  A  and  B  satisfies  the  following  identity 
in  the  entropy  semiring: 

(0,  D(A\\B))  =  s[$i(A)  n  <f>2(log  A)]  -  a[$!(A)  n  $2(log£)].  (10) 

Thus,  the  relative  entropy  is  expressed  in  terms  of  single-source  shortest- 
distance  computations  over  the  entropy  semiring. 

Proof.  Since  A  is  unambiguous  and  complete,  both  $i(A)  and  $2(logA)  are 
also  unambiguous  and  complete.  Thus,  for  a  given  string  x,  there  is  at  most 
one  accepting  path  in  <I>i(A)  or  <E>2(logA)  labeled  with  x.  Then,  by  definition 
of  intersection,  the  weight  associated  by  $i(A)  fl  <b2(log  A)  to  a  string  x  is 

([A] (x),  0)  ®  (1,  log[A](x))  =  (|A](x),[A](x)log[A](x)).  (11) 

Thus,  the  shortest-distance  from  the  initial  states  to  the  final  states  in  <l>i(A)  fl 
$2  (log  A)  is 

s[$i(A)n$2(logA)]  =  0([A](x),[A](x)log[A](x))  (12) 

X 

=  (^[A](x),^[A](x)log[A](x))  (13) 

X  X 

=  (14) 
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Similarly,  we  can  show  that 

s[$i(A)  n  $2 (log 5)]  =  (1,^3 [A] (x)  log [B] (a;)).  (15) 

X 

The  statement  of  the  proposition  follows  directly  from  the  identities  14  and  15 
and  Equation  6.  □ 

Thus,  the  computation  of  the  relative  entropy  is  reduced  to  two  single-source 
shortest-distance  computations  over  the  entropy  semiring.  The  next  section 
discusses  two  general  algorithms  for  computing  these  distances.  Since  the  first 
term  simply  corresponds  to  the  entropy  of  a  single  unambiguous  probabilistic 
automaton,  our  results  clearly  also  apply  to  the  computation  of  the  entropy. 

4.2  Exact  Algorithm 

A  generalization  of  the  classical  Floyd- Warshall  algorithm  can  be  used  to  com¬ 
pute  all-pairs  shortest  distances  d(p ,  q\  (p,  q  £  Q)  over  a  closed  semiring  not 
necessarily  idempotent  [Mohri,  1998,  2002b].  This  algorithm  can  thus  also  be 
used  to  compute  s[A]  for  a  weighted  automaton  A  over  a  non-idempotent  semir¬ 
ing,  which  is  needed  for  our  purpose. 

In  what  follows,  we  assume  a  definition  of  closed  semirings  [Lehmann,  1977] 
that  is  more  general  than  the  classical  one  used  by  Cormen  et  al.  [Cormen  et  al., 
1992]  in  that  it  does  not  assume  idempotence.  This  is  because  idempotence  is 
not  necessary  for  the  proof  of  the  correctness  of  the  generic  all-pairs  shortest- 
distance  algorithms  of  Floyd- Warshall  and  Gauss- Jordan  [Mohri,  1998,  2002b]. 
More  generally,  given  a  graph  or  automaton  A ,  we  introduce  the  following  defi¬ 
nition. 

Definition  4.  A  semiring  is  closed  for  A  if  the  infinite  sum  (closure)  is  defined 
for  any  cycle  weight  c  of  A  and  if  associativity,  commutativity,  and  distributivity 
apply  to  countable  sums  of  cycle  weights. 

Clearly,  the  generic  Floyd- Warshall  algorithm  can  also  be  applied  to  any 
automaton  A  for  which  the  semiring  considered  is  closed.  The  following  lemma 
shows  that  the  entropy  semiring  has  the  desired  property. 

Lemma  3.  Let  A  be  a  weighted  automaton  over  the  entropy  semiring  such  that 
for  any  cycle  weight  w  =  (2,  y),  0  <  x  <  1.  Then,  the  entropy  semiring  is  closed 
for  A. 

Proof.  For  any  (x,  j)el  and  k  >  0,  define  Rk  as: 


k  times 

Rk  =  (x,y)®...®(x,y).  (16) 

with  R0  =  (1,0).  We  can  show  by  induction  that  Rk  =  ( xk,kyxk~ 1).  The  base 
case  is  readily  established  for  k  =  0.  Assume  that  the  hypothesis  holds  for  all 
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i  <  k.  Then 


Rk- i  ® {x,y) 

(xk-\(k-l)yxk-2)®(x,y) 

(xk,kyxk~1). 


(17) 


Rk  = 


For  N  >  0,  define  Sn  by: 
above  that  Sn  verifies 


N 

Sn  =  0  Ri.  It  is  easy  to  prove  by  induction  as 
2=0 


Sn  = 


1  —  x 


N+l 


1  —  X 


,y- 


,N 


1  —  X 


Nx- 


N 


1  —  X 


Thus,  for  0  <  x  <  1,  the  closure  of  (x,  y)  is  well-defined  and  in  K:3 

(x,y)*  =  lim  SN  =  ,  n  V 

N—>oc  \1  —  X  (1  —  xY 


(18) 


(19) 


The  associativity,  commutativity,  and  distributivity  properties  follow  the  asso¬ 
ciativity,  commutativity,  and  distributivity  of  the  sums  Sn  with  other  elements 
of  the  entropy  semiring  and  the  corresponding  properties  of  their  pointwise  lim¬ 
its.  □ 


Let  A  be  a  probabilistic  automaton,  then  the  weight  u  of  a  cycle  must  verify 
0  <  u  <  1,  otherwise  the  automaton  is  not  closed.  The  weight  of  a  cycle  of 
$i(A)  n  $2(logA)  is  (it,  zt  log  it)  (see  Equation  11),  where  u  is  the  weight  of  a 
cycle  of  A,  and  similarly,  the  weight  of  a  cycle  of  <Fi  {A)  fl  ^(log-B)  is  of  the 
form  (u,  itlogi;),  where  v  is  the  weight  of  a  matching  cycle  in  B. 

Thus,  the  entropy  semiring  is  closed  both  for  <l>i(yl)n$2(logil)  and  $i(A)n 
$2  (log  A)  and  the  generic  Floyd- Warshall  algorithm  can  be  applied  to  compute 
the  shortest-distances  s[$i(A)  fl  $2(logl?)]  and  s[$i(A)  D  $2 (log -A)]. 

The  generic  Floyd- Warshall  admits  an  in-place  implementation  [Mohri,  1998]; 
the  following  gives  the  corresponding  pseudocode. 

1  for  i  <—  1  to  |Q  | 

2  do  for  j  <—  1  to  |Q  | 

3  do  d[i,j]  <-  (J)  w[e) 

eeP(i,j ) 

4  for  k  <—  1  to  |Q| 

5  do  for  i  <—  1  to  |Q| 

6  do  for  j  <—  1  to  |Q| 

7  do  d[i,  j)  <—  d[i,j\  ®  (d[i,  k]  ®  d[k,  k]*  ®  d[k,j]) 

8  return  d 

3The  right-hand  side  can  be  written  as:  (x* ,y(x*)2),  if  we  denote  by  x*  =  y~!^L0  xn. 
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The  ©-  and  ©-operations  of  the  entropy  semiring  can  be  performed  in  constant 
time.  For  (x,y)  with  0  <  x  <  1,  the  closure  ( x,y )*  =  (ytyj,  )  can  also  be 
computed  in  constant  time.  Thus,  the  running  time  complexity  of  the  algorithm 
is  ©(|E|  +  |Q|3)  and  its  space  complexity  is  fi(|Q|2)  when  applied  to  a  weighted 
automaton  A  =  (Q,  /,  F,  E,  6 ,  a,  A,  p)  over  the  tropical  semiring. 

The  intersection  <f>i(A)  n  <I>2(log  A)  can  be  computed  in  linear  time  0(|A|) 
but  the  worst  cost  computation  of  $i(A)  D  <i>2(log-B)  is  quadratic,  0(|A||£>|). 
The  total  time  complexity  of  the  computation  of  the  relative  entropy  is  thus  in 
0(|A  D  B |3).  Its  space  complexity  is  in  0(|A  fl  B\2). 

This  provides  an  exact  algorithm  for  the  computation  of  the  relative  entropy. 
The  cubic  time  complexity  of  the  algorithm  with  respect  to  the  size  of  the 
intersection  automaton  makes  it  rather  slow  for  large  automata. 

Its  quadratic  lower  bound  complexity  with  respect  to  the  size  of  the  inter¬ 
section  machine  makes  it  prohibitive  for  use  in  many  applications.  In  text  and 
speech  processing  applications,  a  weighted  automaton  may  have  several  hun¬ 
dred  million  states  and  transitions.  Even,  if  A  has  only  about  100,000  states 
and  An  B  has  about  the  same  number  of  states,  the  algorithm  requires  main¬ 
taining  a  matrix  d  with  10  billion  entries. 

The  next  section  presents  an  algorithm  that  exploits  the  sparseness  of  the 
graph  and  does  not  impose  these  space  requirements. 

4.3  Approximate  Algorithm 

A  generic  single-source  shortest-distance  algorithm  was  presented  for  directed 
graphs  defined  over  a  k-closed  semiring  in  [Mohri,  2002b].  The  algorithm  can 
be  viewed  as  a  generalization  to  these  semirings  of  classical  shortest-paths  al¬ 
gorithms.  This  generalization  is  not  trivial  and  does  not  require  the  semiring 
to  be  idempotent.  The  algorithm  is  also  generic  in  the  sense  that  it  works  with 
any  queue  discipline. 

Definition  5.  Let  k  >  0  be  an  integer.  A  semiring  (K,  ©,  ®,  0, 1)  is  fc-closed  if: 

fc+1  k 

VaeK,  0a"  =  0on.  (20) 

n— 0  n— 0 

More  generally,  we  will  say  that  K  is  k-closed  for  a  graph  G  or  automaton 
A,  if  Equation  20  holds  for  all  cycle  weights  a  £  K. 

By  definition,  the  entropy  semiring  is  fc-closed  for  any  acyclic  automaton  A 
and  thus  the  generic  single-source  shortest  distance  can  be  used  to  compute  the 
relative  entropy  exactly  in  such  cases.  But,  in  general,  the  entropy  semiring  is 
not  k- closed  for  a  non- acyclic  automaton  A  since  by  definition  of  Sjy, 

Vfc  >  0,  Sk+1  -Sk  =  Rk+ 1  =  (xk+1,  (k  +  1  )yxk).  (21) 

But,  given  a  weighted  automaton  A  over  the  entropy  semiring  such  that  all  cycle 
weights  w  =  (x,  y)  verify  0  <  x  <  1,  there  exists  Ka  sufficiently  large  such  that 
for  all  k  >  Ka ,  HiSfc+i  —  SfcHoo  <  e.  Indeed,  let  X  denote  the  maximum  value  of  x 
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for  all  cycles  and  Y  the  maximum  \y\.  Then,  for  k  >  l°g(ljx)  >  H^fe+i  — Sk\\oo  <  e 
for  all  (x7y).  This  leads  us  to  consider  an  approximate  version  of  the  generic 
single-source  shortest  distance  algorithm  in  non-acyclic  cases,  where  the  equality 
test  is  replaced  by  an  e-equality:  u  =e  v  if  ||u  —  u||oo  <  e.  The  following  gives 
the  pseudocode  of  the  modified  algorithm. 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 


for  i  <—  1  to  |  Q  | 
do  d[i]  <—  r[i]  <—  0 
d[s]  <—  r[s]  1 

S*~{s} 

while  S  ^  0 


do  q  <—  head(S) 

Dequeue  (S) 
r'  <—  r[q] 
r[q\  <-  0 

for  each  e  £  E[q\ 

do  if  d[n[e]]  d[n[e]]  ©  (r1  ®  «;[e]) 

then  d[n[e]]  <—  d[n[e]]  ©  (r'  ®  rr[e]) 
r[?r[e]]  r[n[e]]  ©  (r'  ©  w[e]) 

if  n[e]  £  S 

then  Enqueue^,  n[e]) 


d[q]  denotes  the  tentative  shortest  distance  from  the  source  s  to  q.  r[q]  keeps 
track  of  the  sum  of  the  weights  added  to  d[q ]  since  the  last  queue  extraction 
of  q.  The  attribute  r  is  needed  for  the  shortest-distance  algorithm  to  work  in 
non-idempotent  cases.  The  algorithm  uses  a  queue  S  to  store  the  set  of  states 
to  consider  for  the  relaxation  steps  of  lines  11-15  [Mohri,  2002b].  Any  queue 
discipline,  e.g.,  FIFO,  shortest-first,  topological  (in  the  acyclic  case),  can  be 
used.  The  test  of  line  11  is  based  on  an  e-equality. 

Different  queue  disciplines  yield  different  running  times  for  our  algorithm. 
The  choice  of  the  best  queue  discipline  to  use  can  be  based  on  the  structure  of 
the  two  automata,  which  can  be  exploited  to  obtain  a  more  efficient  algorithm  to 
compute  the  relative  entropy.  More  specifically,  let  Q,E  denote  (respectively) 
the  set  of  states  and  edges  in  the  intersection  automata.  Further,  let  N(q) 
denote  the  number  of  times  a  state  q  is  inserted  in  the  queue.  Then,  using  the 
Fibonacci  heap  with  a  shortest  first  queue  discipline  (as  in  Dijkstra’s  algorithm), 
the  complexity  of  the  algorithm  is  given  by: 

0(|Q|  +  \E\  max N (q)  +  log  |Q|  V  N(q)).  (22) 

If  the  underlying  automata  are  acyclic,  then  using  the  queue  discipline  cor¬ 
responding  to  the  topological  order  yields  the  best  time  complexity,  and  the 
problem  can  be  solved  in  linear  time:  0(|<5|  +  |E|). 

Using  a  breadth-first  queue  discipline  (as  in  the  Bellman-Ford  shortest  dis¬ 
tance  algorithm),  updates  to  the  shortest  distance  estimates  in  iteration  k  can 


14 


be  formulated  as  Dk  =  MDk~1,  where  M  is  the  matrix  associated  to  the  au¬ 
tomaton,  that  is  the  matrix  representing  the  weighted  graph  defined  by  the 
automaton.  Note  that  the  matrix  multiplication  here  is  over  the  ®  and  ®  op¬ 


erations  of  the  semiring,  so  that  Dk[i\  =  ®  ®  Dk~1[j], 

We  now  analyze  the  convergence  rate  of  the  approximate  algorithm  with  the 
breadth-first  queue  discipline.  Let  us  focus  only  on  the  first  component  of  the 
distance  pair.  Let  M\  be  the  matrix  obtained  by  taking  the  first  part  of  each 
element  of  M.  Assume  that  the  matrix  M  is  a  stochastic  matrix. 

By  the  Perron-Frobenius  theorem,  we  know  that  the  largest  eigenvalue  is  1 
and  has  a  multiplicity  of  1.  Furthermore,  all  other  eigenvalues  A  are  such  that 
|  A  |  <  1.  Using  the  Jordan  canonical  form  of  M,  it  is  not  hard  to  show  that  the 
matrix  multiplication  operation  converges  in  0(| A2  |fc)7  where  A2  is  the  second 
largest  eigenvalue  of  M  (see  [Golub  and  Loan,  1996]  for  a  similar  analysis).  Thus, 
the  updates  in  the  kth  iteration  are  proportional  to  A§,  hence,  k  =  ■ 

Plugging  in  this  expression  for  N ( q ) ,  the  overall  complexity  of  the  approximate 
algorithm  is: 

o(\Q\  +  (\E\  +  |Q|)-  log(1/e) 


l°g(l/l  A2 1) 


(23) 


For  e  exponentially  smaller  than  I A2 1  (e  =  I A2  lrf),  the  cost  in  complexity  is  only 
linear:  0(\Q\  +  d(\E\  +  \Q\)). 

It  is  possible  to  use  different  queue  disciplines  in  different  parts  of  the  graph 
and  improve  the  running  time  of  the  algorithm.  For  example,  for  a  large  graph 
with  several  strongly  connected  components,  one  can  use  a  topological  order 
on  the  component  graph,  with  shortest-first  queue  discipline  in  each  strongly 
connected  component  [Mohri,  2002b].  If  there  are  k  strongly  connected  com¬ 
ponents,  with  the  ith  component  having  ni  vertices,  then  the  running  time  is 
given  by  0(\Q\  +  \E\  ma xqGQ  N(q)  +  log  |  maxi  n;|  If  the  largest 

component  has  0(n/k)  vertices,  then  this  improves  the  general  complexity  by 
an  additive  factor  of  ]C<yeQ  -N(g)  log  k.  Our  experience  with  such  computations 
for  very  large  graphs  of  several  million  states  shows  that  the  generic  topologi¬ 
cal  order  with  the  shortest-hst  queue  discipline  within  each  strongly  connected 
component  often  leads  to  the  most  efficient  results  in  practice. 


4.4  Comparison  with  Previous  Work 

In  [Carrasco,  1997],  the  author  describes  a  procedure  for  an  approximate  com¬ 
putation  of  the  relative  entropy  of  two  deterministic  stochastic  automata.  The 
procedure  is  based  on  an  iterative  method  (which  can  be  viewed  as  approxi¬ 
mating  the  inverse  of  a  matrix)  for  computing,  for  a  stochastic  automaton  A, 
the  probability  of  each  state  q,  that  is  the  sum  of  the  weights  of  all  paths  going 
through  q.  The  convergence  is  claimed  but  not  proved  and  no  bound  is  indicated 
on  the  maximum  number  of  iterations. 

The  author  reports  no  complexity  result  for  the  procedure  described,  which 
makes  it  difficult  to  compare  with  our  algorithm.  Our  most  favorable  estimate 
of  its  complexity  is  U(|A|2|iJ|2(T-|-  |E|)),  where  T  denotes  the  maximum  number 
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of  iterations  executed.  This  is  because  the  procedure  requires  using  a  matrix  of 
size  |A|2|.B|2.  The  complexity  of  the  procedure  also  depends  on  the  size  of  the 
alphabet,  which,  in  some  applications  such  as  natural  language  processing  ap¬ 
plications,  may  be  very  large.  Furthermore,  the  lower  bound  space  complexity 
of  this  procedure  isn(|A|2|.B|2).  This  makes  it  unsuitable  for  computing  the  rel¬ 
ative  entropy  of  large  weighted  automata.  Note  that  the  experiments  reported 
by  the  author  were  carried  out  with  very  small  grammars  of  about  30  rules. 
Nevertheless,  the  procedure  bears  some  resemblance  with  our  approximate  al¬ 
gorithm.  It  can  be  viewed  as  an  alphabet-dependent  non-sparse  implementation 
of  that  algorithm  for  the  particular  case  of  a  FIFO  queue  discipline. 

4.5  Experiments 

We  implemented  both  the  generic  Floyd- Warshall  algorithm  and  the  approx¬ 
imate  algorithm  for  the  computation  of  the  relative  entropy  of  unambiguous 
probabilistic  automata. 

To  avoid  the  numerical  instability  issues  related  to  the  multiplications  of 
probabilities,  we  used  instead  negative  log  probabilities.  This  corresponds  to 
taking  the  image  of  the  entropy  semiring  by  the  semiring  morphism  log  x  / 
where  I  is  the  identity  over  the  second  element  of  the  weights. 

To  evaluate  the  efficiency  of  our  approximate  algorithm  for  computing  the 
relative  entropy  we  created  two  n-gram  statistical  models  trained  on  a  large 
corpus  -  one  a  bigram  model  ( n  =  2)  and  one  a  trigram  model  (n  =  3).  The 
minimal  deterministic  weighted  automaton  representing  the  bigram  model  had 
about  200,000  transitions,  that  of  the  trigram  model  about  400,000  transitions. 
It  took  about  3s  on  a  single  2GHz  Intel  processor  with  128MB  of  RAM  to 
compute  the  relative  entropy  of  these  large  weighted  automata  using  a  FIFO 
queue  discipline.  With  a  shortest-first  queue  discipline,  the  time  was  reduced 
to  2s. 


5  Relative  Entropy  of  Arbitrary  Probabilistic 
Automata 

This  section  proves  a  hardness  result  suggesting  that  the  problem  of  computing 
the  relative  entropy  of  arbitrary  probabilistic  automata  is  intractable. 

5.1  Hardness  Result 

Automaton  Aq.  First  we  wish  to  describe  the  automaton  Aq.  This  automa¬ 
ton  is  used  in  reducing  the  problem  of  determining  whether  the  language  ac¬ 
cepted  by  an  automaton  is  E*  to  the  problem  of  determining  whether  the  relative 
entropy  of  two  probabilistic  automata  is  infinite. 

Fix  a  real  number  a  >  0  such  that  a|£|  <  1  and  let  Aq  be  the  one-state 
weighted  automaton  representing  the  rational  power  series  (1  —  a)  (Exes  ax)* 
as  depicted  in  Figure  3  for  E  =  (a,  b}.  By  definition,  A0  accepts  all  strings 
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Figure  3:  The  automaton  Aa  that  accepts  all  strings,  {a,  b}*,  and  assigns  a 
weight  of  an[l  —  a)  to  any  string  of  length  n.  a  >  0  is  a  constant  such  that 
2a  <  1. 


x  £  £*  and  for  all  x  £  £*,  [[^4] (cc)  =  alxl(l  —  |£|a).  By  construction,  A0  is 
stochastic  and  thus  probabilistic.  Here  is  also  a  direct  verification: 


OO  OO 


E  I4>]  (*) 

=  E  E  «”(1-|£|a)  =  E|S|V(1-|£|a) 

(24) 

s 

II 

o 

IT 

II 

3 

3 

II 

o 

=  (1  —  £  a)- - — —  =  1. 

1  —  La 

(25) 

The  following  theorem  shows  that  the  problem  of  determining  the  relative 
entropy  of  two  arbitrary  probabilistic  automata  is  at  least  as  hard  as  determining 
if  a  finite  automaton  accepts  £*. 

Theorem  4.  Let  A  be  an  arbitrary  probabilistic  automaton,  then  D(Aq\\A)  <  oo 
iff  A  accepts  £*. 

Proof.  Assume  that  |A](x)  =  0  for  some  x  £  £*.  Then,  since  [A0](x)  >  0, 
[A0](x)  log  is  infinite  and  D{A0\\A)  =  oo. 

Assume  now  that  A  accepts  £*,  thus  |A](x)  0  for  all  x  £  £*.  Without  loss 

of  generality,  we  can  assume  A  to  be  trim.  Let  E  denote  the  set  of  transitions 
of  A  and  let  S  denote  the  minimum  weight  of  a  transition:  S  =  mineg£  w[e]. 
By  assumption,  <5  >  0  since  the  automaton  A  is  trim  and  probabilistic.  For 
x  £  £*,  |x|  =  n,  |A](x)  >  Sn.  Thus 


Vx  £  £*, 


I^oIQe) 


an(l-  |£|oQ 


<d-P|o)(f) 


It  follows  that: 


(26) 


Vx  £  £*,  |A0](x)  log  <  an(l  -  |£|a)  (nlog(a/5)  +  log(l  -  |£|a)) . 

(27) 

For  any  positive  integer  N,  summing  over  all  strings  x  of  length  at  most  N,  in 
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the  order  of  increasing  |a;|  yields: 


Y  MJ(M°g 

x\<N 


MIM 

MO) 


< 


N 

Y  Y  mi  o)  i°g 

n=0  x:\x\  —n 


MIO) 

MO) 


(28) 


N 

Y  mnan(l  -  |E|a)  (nlog(a/*)  +  log(l  -  |E|a)) . 

n= 0 


Since  a|E|  <  1  the  two  series  in  this  summation,  ^2nn/3n  and  'f2n/3n  with 
/ 3  =  |E|a  <  1,  converge.  It  is  straightforward  to  verify  that  for  0  <  /3  <  1, 
X))Oo  nfln  =  ( i —jj ) 2  ■  Using  this  identity,  we  obtain  the  following  bound  on 
D(A0\\A): 


D(A0\\A)  < 


(1  —  |E|a) 


/|E|alog  (a/5) 
\  (l-|E|a)2 


log(l-|E|a)\ 
l-|E|a  )■ 


(29) 


Thus  D(Aq\\A)  <  oo. 


□ 


Theorem  5.  The  problem  of  computing  the  relative  entropy  of  two  arbitrary 
probabilistic  automata  is  P SPACE- complete. 


Proof.  The  universality  problem,  i.e.,  the  problem  of  deciding  if  a  trim  finite 
automaton  A  accepts  E*,  is  PSPACE-complete  [Stockmeyer  and  Meyer,  1973, 
Garey  and  Johnson,  1979].  The  transitions  of  any  trim  finite  automaton  A  can 
be  augmented  with  positive  weights  so  that  it  becomes  a  probabilistic  automa¬ 
ton.  This  can  be  done  by  weighting  each  outgoing  transition  of  state  q ,  or  final 
weight  if  q  is  final,  by  1  /nq  where  nq  is  the  out-degree  of  q ,  plus  one  if  q  is  final. 
By  Theorem  4,  it  can  be  decided  if  a  probabilistic  automaton  A  accepts  all 
strings  by  computing  the  relative  entropy  D(Aq\\A)  and  testing  its  finiteness. 
Thus,  the  computation  of  the  relative  entropy  can  determine  if  a  trim  finite 
automaton  A  accepts  E*.  □ 


5.2  Remarks 

Theorem  5  suggests  that  the  general  problem  of  computing  the  relative  entropy 
of  arbitrary  probabilistic  automata  is  intractable.  However,  one  may  resort  to 
various  approximations  of  practical  importance.  An  example  is  an  approxima¬ 
tion  based  on  the  use  of  the  log-sum  inequality  by  [Singer  and  Warmuth,  1997] 
in  the  context  of  machine  learning.  We  have  initiated  a  specific  study  of  such 
approximations,  in  particular  by  examining  the  quality  of  an  approximation 
when  using  the  algorithms  we  presented  for  the  unambiguous  case. 

Note  that  the  general  problem  of  determining  if  a  weighted  automaton  over 
the  (R,  +,  •,  0, 1)  semiring  accepts  the  full  free  monoid  E*  is  undecidable  [Berstel 
and  Reutenauer,  1988].  Here,  we  are  considering  the  same  decidability  question 
but  only  for  probabilistic  automata,  which  form  a  restricted  class  of  all  weighted 
automata  over  the  (R,  +,  •,  0, 1)  semiring.  However,  we  conjecture  that  the  prob¬ 
lem  is  in  fact  undecidable  even  in  this  case. 
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6  Relative  Entropy  as  a  Kernel 

This  section  examines  the  use  of  the  relative  entropy,  or  its  symmetrized  version, 
in  machine  learning  algorithms.  The  results  hold  in  general  and  are  not  limited 
to  the  particular  case  of  probabilistic  automata. 

In  machine  learning,  functions  K  :  X  x  X  — >  R  are  called  kernels.  A  ker¬ 
nel  is  said  to  be  positive  definite  symmetric  (PDS  for  short)  if  it  is  symmetric , 
K(x,y)  =  K(y,x)  for  all  x,y  G  X,  and  if  for  any  subset  {x±, . . . ,  xm}  C  A, 
the  eigenvalues  of  the  matrix  [K(xi,  are  non-negative.  PDS  ker¬ 

nels  play  an  important  role  in  machine  learning  since  they  can  be  combined 
with  discriminant  algorithms  such  as  support  vector  machines  (SVMs)  to  create 
powerful  predictors  [Scholkopf  and  Smola,  2002],  the  PDS  condition  ensuring 
the  convergence  of  training. 

In  some  cases,  a  symmetric  kernel  K  is  not  positive  definite  but  exp(— A K) 
is  PDS  for  any  A  >  0.  AT  is  then  said  to  be  negative  definite  symmetric  (NDS). 
Such  kernels  are  also  important  since  they  can  be  used  to  defined  PDS  kernels 
as  in  the  case  of  Gaussian  kernels. 

We  will  show  however  that  the  symmetrized  relative  entropy  is  neither  PDS 
nor  NDS,  contrarily  to  what  is  stated  in  a  number  of  machine  learning  papers, 
which  limits  its  use  and  application  in  kernel  methods. 

The  symmetrized  relative  entropy  of  two  distributions  p  and  q  is  given  by: 

n  x  I,  x  D(p\\q)  +  D(q\\p)  r  /  x  /  m,  P(x)  ,on\ 

Dsym(p\\q )  =  - r -  =  2^  [pO)  -  g(z)]  log  -yw-  (30) 

Z  xex  q[x> 

Theorem  6.  The  symmetrized  relative  entropy  is  not  a  PDS  kernel. 

Proof.  Let  {q\ ,  <72,  ■  •  • ,  qm}  be  a  set  of  probability  distributions  over  X.  Consider 
the  Gram  matrix  K  G  Rmxm  defined  by  =  Dsym(qi\\qj).  By  definition  of 
DSym ,  Dsyin(qi\\qi)  =  0  for  all  i  G  [l,m],  thus  tr(K)  =  0.  When  K/0,  this 
implies  that  K  admits  at  least  one  negative  eigenvalue.  □ 

To  show  that  the  symmetrized  relative  entropy  is  not  an  NDS  kernel,  we  use 
the  following  property  of  NDS  kernels. 

Theorem  7  ([Berg  et  ah,  1984]).  Let  K  :  X  x  X  — >  R  he  an  NDS  kernel  such 
that  for  x,  y  G  X ,  K(x,y)  =0  iff  x  =  y.  Then,  there  exist  a  Hilbert  space  H 
and  a  mapping  $  :  X  — >  H  such  that 

Vx,  y  £  X,  K(x,y)  =  \\<S>(x)  -  <I>(y)\\2.  (31) 

Thus,  under  the  hypothesis  of  the  theorem,  \[K  defines  a  metric. 

Theorem  8.  The  symmetrized  relative  entropy  is  not  an  NDS  kernel. 

Proof.  Note  that  for  any  two  distributions  p  and  q,  Dsym(p\\q)  =  0  iff  D(p\\q)  = 
D(q\\p)  =  0  that  is  iff  p  =  q.  Thus,  by  Theorem  7,  if  Dsym  is  an  NDS  kernel, 
yj Dsym  defines  a  metric.  We  prove  that  yj Dsym  does  not  obey  the  triangle 
inequality,  which  will  show  that  Dsym  is  not  NDS. 
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For  the  sake  of  simplicity,  the  proof  is  given  in  the  case  of  a  universe  of 
events  limited  to  two  elements:  X  =  (aq,  £2}.  Let  e  >  0  and  let  91,  <72,  <73  be  the 
three  distributions  over  X  defined  by: 


V*  £  [1,3],  qi{x  1)  =  1  —  it  and  qi(x 2)  =  it.  (32) 

By  definition  of  the  symmetrized  relative  entropy, 

Dsym(qi\ I92)  =elog^— ^  -elogA  =  el°g  2[\  ^  ■  (33) 

Similarly,  Aym(<72||<73)  =  elog  and  Aym(<7i  H93)  =  2e  log  ■  Note 

that: 


Amtell | <73)  =2elog3(11_32ee)  =  2 (e log  +  e log  ) 

=  11^2)  H-  D  Sym  (Q2  II  ^3))  • 

Since  is  strictly  concave, 

y/DsyM\q3)  =2^/D—^M-  + 

>Vd  sym  (<7l||<72)  +  y/D  sym  MoO- 


This  shows  that  yj Dsym  does  not  obey  the  triangle  inequality. 


(34) 

(35) 

□ 


7  Computation  of  the  Norm  of  a  Probabilistic 
Automaton 

In  Section  4,  we  gave  a  general  algorithm  for  computing  the  relative  entropy  of 
two  unambiguous  probabilistic  automaton  by  relating  this  problem  to  a  shortest- 
distance  problem  over  the  appropriate  semiring.  A  special  case  of  that  algorithm 
can  be  used  to  compute  the  entropy  of  a  single  unambiguous  probabilistic  au¬ 
tomaton.  One  may  ask  if  such  results  could  be  generalized  to  the  computation 
of  other  similar  quantities  that  we  will  refer  to  as  the  norm  of  an  unambiguous 
probabilistic  automaton.  This  section  shows  how  they  can  be  generalized  indeed 
by  considering  an  arbitrary  monoid  morphism. 

7.1  Computation  of  the  Norm  of  an  Unambiguous  Prob¬ 
abilistic  Automaton 

Let  (K,  ®,  ®,0,1)  be  a  closed,  or  e-/c-closed  semiring.  Let  $  :  (R+,-,l)  — ► 
(K,  (gi,  1)  be  a  monoid  morphism.  We  will  say  that  <1  preserves  closedness,  if  for 
all  x,  0  <  x  <  1,  C^o  $(xn)  is  well-defined  and  in  K.  For  a  such  a  morphism, 
we  can  define  the  <I>-7iorm  of  a  probabilistic  automaton  as: 

1141*  =  0  *(I4K*))-  (36) 
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Theorem  9.  Let  (K,  ®,  ®,  0, 1)  be  a  closed  or  e-k-closed  semiring  and  let  $  : 
(R+,-,1)  — >  (K,®,1)  be  a  monoid  morphism  preserving  closedness.  Then,  for 
any  unambiguous  probabilistic  automaton  A,  ||A||$  can  be  computed  exactly  in 
time  0(| v4|3) . 

Proof.  The  automaton  <I>(A)  derived  from  A  by  replacing  each  weight  x  by 
$(x)  is  a  weighted  automaton  over  the  semiring  IK.  Since  A  is  unambiguous, 
at  most  one  path  in  A,  ir  =  ei  ■  •  •  e*,  is  labeled  with  any  string  itE'.  Since 
$  is  a  monoid  morphism,  $([A](x))  =  ®J=1  ^(*[ej])>  that  is  the  weight  of  the 
path  labeled  with  x  in  <I>(A).  This  shows  that  ||A||<j>  =  s(A)  and  proves  the 
theorem.  □ 

Theorem  9  provides  an  algorithm  for  computing  the  <f>-norm  of  unambiguous 
probabilistic  automata  for  arbitrary  monoid  morphisms  preserving  closedness. 
We  will  briefly  illustrate  two  applications  of  the  theorem. 

(a)  Entropy  of  a  Probabilistic  Automaton. 

Let  (K,  ®,  (g>,  (0, 0),  (1,0))  be  the  entropy  semiring.  It  is  not  hard  to 
see  that  function  :  (R+,  +,  •,  0, 1)  — >  (K,  ®,  ®,  (0,  0),  (1,  0))  defined  by: 
\/x  £  R-|_,$(x)  =  (a;, —a;  log  a:),  is  a  monoid  morphism  preserving  closed¬ 
ness.  Thus,  the  norm-d>  of  an  unambiguous  probabilistic  automaton  can 
be  computed  efficiently  using  a  single-source  shortest-distance  algorithm. 
Its  second  component  is  exactly  the  entropy  of  A ,  thus  this  provides  an 
efficient  and  simple  algorithm  for  computing  the  entropy  of  A. 

(b)  Norm  La  of  a  Probabilistic  Automaton,  a  £  R+. 

The  function  $  :  (R+,+, 0, 1)  — >  (R+,+, -,0, 1)  defined  by  $(a;)  =  xa 
is  clearly  a  monoid  morphism.  Since  for  0<x<l,0<x“<l,  it  also 
preserves  closedness.  Thus,  the  LQ-norm  of  an  unambiguous  probabilistic 
automaton  A  can  be  computed  efficiently  using  a  shortest-distance  algo¬ 
rithm.  In  particular,  the  Bhattacharya  norm,  i.e. ,  Li-norm,  of  A  can  be 
computed  efficiently. 

7.2  Computation  of  the  Norm  of  Arbitrary  Automata 

In  general,  a  probabilistic  automaton  may  not  be  unambiguous.  But,  the  Up- 
norm  can  still  be  computed  in  polynomial  time  for  any  integer  p  >  1. 

Theorem  10.  The  Lp-norm  of  a  probabilistic  automaton  A  can  be  computed 
exactly  in  time  0(|A|3p)  time  and  0(|A|2p)  space. 

Proof.  Let  A ^  denote  the  automaton  obtained  by  intersecting  A  with  itself  p—  1 
times.  Then,  by  definition  of  intersection,  (s[Ap)])1/p  represents  the  Lp-norm 
of  A.  The  cost  of  intersection  to  create  A ^  is  in  0{\A\P).  □ 
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7.3  Approximate  Computation 

Here  we  consider  the  specific  case  of  the  computation  of  the  Lp-norm  of  a 
probabilistic  automaton.  Our  results  can  be  generalized  to  cover  more  general 
cases,  in  particular  in  the  case  of  unambiguous  automata. 

Since  for  any  e  >  0,  a  probabilistic  automaton  is  e-fc-closed  for  the  probability 
semiring,  instead  of  the  (generalized)  Floyd- Warshall  algorithm,  we  can  use  a 
single-source  shortest-distance  algorithm  to  compute  s[A]  as  already  described 
in  Section  4.3.  This  algorithm  works  with  any  queue  discipline  and  its  space 
complexity  is  linear  which  is  significantly  more  efficient  than  the  Floyd- Warshall 
algorithm.  The  complexity  results  and  analyses  detailed  in  Section  4.3  apply 
identically  here. 


8  Conclusion 

We  presented  an  exhaustive  study  of  the  problem  of  computing  the  relative 
entropy  of  probabilistic  automata. 

Our  results  demonstrate  the  benefit  of  semiring  theory  for  the  formulation  of 
the  problem  which  becomes  as  a  single-source  shortest-distance  one.  This  results 
in  the  definition  of  simple  but  efficient  algorithms,  both  exact  and  approximate, 
for  the  computation  of  the  relative  entropy  of  two  unambiguous  probabilistic 
automata  or  the  entropy  of  a  single  unambiguous  probabilistic  automaton.  As 
shown  by  our  experimental  results,  these  algorithms  scale  to  large  probabilistic 
automata  of  several  hundred  thousand  transitions. 

Our  algorithms  can  be  adapted  straightforwardly  to  compute  the  so-called 
unnormalized  relative  entropy  of  two  unambiguous  probabilistic  automata,  de¬ 
fined  by: 

D(A\\B)  =  ^[A]^)  log  MM  _  [A](x)  +  [B](aO  (37) 

simply  by  replacing  4>i  and  4> 2  by  4^  and  4*2,  where  4>^(A)  (&2 (A))  is  the 
weighted  automaton  over  the  entropy  semiring  derived  from  A  by  replacing 
each  weight  w  with  the  pair  ( ui ,  1)  (resp  (re,  w )).  The  entropy  semiring  can  also 
be  used  to  give  a  conceptually  simple  formulation  of  the  computation  of  the 
relative  entropy  of  tree  automata  and  to  derive  similar  computation  algorithms. 

We  proved  that  the  computation  of  the  relative  entropy  of  arbitrary  proba¬ 
bilistic  automata  is  PSPACE-complete  and  thus  likely  to  be  intractable.  This 
suggests  examining  approximate  computations  of  the  relative  entropy.  We  have 
already  initiated  the  study  of  a  natural  approximate  computation  of  the  relative 
entropy  that  extends  the  results  presented  in  this  paper. 
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